26 research outputs found
Frame- and Segment-Level Features and Candidate Pool Evaluation for Video Caption Generation
We present our submission to the Microsoft Video to Language Challenge of
generating short captions describing videos in the challenge dataset. Our model
is based on the encoder--decoder pipeline, popular in image and video
captioning systems. We propose to utilize two different kinds of video
features, one to capture the video content in terms of objects and attributes,
and the other to capture the motion and action information. Using these diverse
features we train models specializing in two separate input sub-domains. We
then train an evaluator model which is used to pick the best caption from the
pool of candidates generated by these domain expert models. We argue that this
approach is better suited for the current video captioning task, compared to
using a single model, due to the diversity in the dataset.
Efficacy of our method is proven by the fact that it was rated best in MSR
Video to Language Challenge, as per human evaluation. Additionally, we were
ranked second in the automatic evaluation metrics based table
Natural Language Description of Images and Videos
Understanding visual media, i.e. images and videos, has been a cornerstone topic in computer vision research for a long time. Recently, a new task within the purview of this research area, that of automatically captioning images and videos, has garnered wide-spread interest. The task involves generating a short natural
language description of an image or a video.
This thesis studies the automatic visual captioning problem in its entirety. A baseline visual captioning pipeline is examined, including its two constituent blocks, namely visual feature extraction and language modeling. We then discuss the challenges involved and the methods available to evaluate a visual captioning system. Building on this baseline model, several enhancements are proposed to improve the performance of both the visual feature extraction and the language modeling. Deep convolutional neural network based image features used in the baseline model are augmented with explicit object and scene detection features.
In the case of videos, a combination of action recognition and static frame-level features are used. The long-short term memory network based language model used in the baseline is extended by introduction of an additional input channel and residual connections. Finally, an efficient ensembling technique based on a caption evaluator network is presented.
Results from extensive experiments conducted to evaluate each of the above mentioned enhancements are reported. The image and video captioning architectures proposed in this thesis achieve state-of-the-art performance on the corresponding tasks. To support these claims, results from two video captioning challenges organized over the last year are reported, both of which were won by the models presented in the thesis. We also quantitatively analyze the automatic captions generated and identify several shortcomings of the current system. After having identified the deficiencies, we briefly look at a few interesting problems which could take the automatic visual captioning research forward
Not Using the Car to See the Sidewalk: Quantifying and Controlling the Effects of Context in Classification and Segmentation
Importance of visual context in scene understanding tasks is well recognized
in the computer vision community. However, to what extent the computer vision
models for image classification and semantic segmentation are dependent on the
context to make their predictions is unclear. A model overly relying on context
will fail when encountering objects in context distributions different from
training data and hence it is important to identify these dependencies before
we can deploy the models in the real-world. We propose a method to quantify the
sensitivity of black-box vision models to visual context by editing images to
remove selected objects and measuring the response of the target models. We
apply this methodology on two tasks, image classification and semantic
segmentation, and discover undesirable dependency between objects and context,
for example that "sidewalk" segmentation relies heavily on "cars" being present
in the image. We propose an object removal based data augmentation solution to
mitigate this dependency and increase the robustness of classification and
segmentation models to contextual variations. Our experiments show that the
proposed data augmentation helps these models improve the performance in
out-of-context scenarios, while preserving the performance on regular data.Comment: 14 pages (12 figures
Adversarial content manipulation for analyzing and improving model robustness
The recent rapid progress in machine learning systems has opened up many real-world applications --- from recommendation engines on web platforms to safety critical systems like autonomous vehicles. A model deployed in the real-world will often encounter inputs far from its training distribution. For example, a self-driving car might come across a black stop sign in the wild. To ensure safe operation, it is vital to quantify the robustness of machine learning models to such out-of-distribution data before releasing them into the real-world. However, the standard paradigm of benchmarking machine learning models with fixed size test sets drawn from the same distribution as the training data is insufficient to identify these corner cases efficiently. In principle, if we could generate all valid variations of an input and measure the model response, we could quantify and guarantee model robustness locally. Yet, doing this with real world data is not scalable. In this thesis, we propose an alternative, using generative models to create synthetic data variations at scale and test robustness of target models to these variations. We explore methods to generate semantic data variations in a controlled fashion across visual and text modalities. We build generative models capable of performing controlled manipulation of data like changing visual context, editing appearance of an object in images or changing writing style of text. Leveraging these generative models we propose tools to study robustness of computer vision systems to input variations and systematically identify failure modes. In the text domain, we deploy these generative models to improve diversity of image captioning systems and perform writing style manipulation to obfuscate private attributes of the user. Our studies quantifying model robustness explore two kinds of input manipulations, model-agnostic and model-targeted. The model-agnostic manipulations leverage human knowledge to choose the kinds of changes without considering the target model being tested. This includes automatically editing images to remove objects not directly relevant to the task and create variations in visual context. Alternatively, in the model-targeted approach the input variations performed are directly adversarially guided by the target model. For example, we adversarially manipulate the appearance of an object in the image to fool an object detector, guided by the gradients of the detector. Using these methods, we measure and improve the robustness of various computer vision systems -- specifically image classification, segmentation, object detection and visual question answering systems -- to semantic input variations.Der schnelle Fortschritt von Methoden des maschinellen Lernens hat viele neue Anwendungen ermöglicht – von Recommender-Systemen bis hin zu sicherheitskritischen Systemen wie autonomen Fahrzeugen. In der realen Welt werden diese Systeme oft mit Eingaben außerhalb der Verteilung der Trainingsdaten konfrontiert. Zum Beispiel könnte ein autonomes Fahrzeug einem schwarzen Stoppschild begegnen. Um sicheren Betrieb zu gewährleisten, ist es entscheidend, die Robustheit dieser Systeme zu quantifizieren, bevor sie in der Praxis eingesetzt werden. Aktuell werden diese Modelle auf festen Eingaben von derselben Verteilung wie die Trainingsdaten evaluiert. Allerdings ist diese Strategie unzureichend, um solche Ausnahmefälle zu identifizieren. Prinzipiell könnte die Robustheit “lokal” bestimmt werden, indem wir alle zulässigen Variationen einer Eingabe generieren und die Ausgabe des Systems überprüfen. Jedoch skaliert dieser Ansatz schlecht zu echten Daten. In dieser Arbeit benutzen wir generative Modelle, um synthetische Variationen von Eingaben zu erstellen und so die Robustheit eines Modells zu überprüfen. Wir erforschen Methoden, die es uns erlauben, kontrolliert semantische Änderungen an Bild- und Textdaten vorzunehmen. Wir lernen generative Modelle, die kontrollierte Manipulation von Daten ermöglichen, zum Beispiel den visuellen Kontext zu ändern, die Erscheinung eines Objekts zu bearbeiten oder den Schreibstil von Text zu ändern. Basierend auf diesen Modellen entwickeln wir neue Methoden, um die Robustheit von Bilderkennungssystemen bezüglich Variationen in den Eingaben zu untersuchen und Fehlverhalten zu identifizieren. Im Gebiet von Textdaten verwenden wir diese Modelle, um die Diversität von sogenannten Automatische Bildbeschriftung-Modellen zu verbessern und Schreibtstil-Manipulation zu erlauben, um private Attribute des Benutzers zu verschleiern. Um die Robustheit von Modellen zu quantifizieren, werden zwei Arten von Eingabemanipulationen untersucht: Modell-agnostische und Modell-spezifische Manipulationen. Modell-agnostische Manipulationen basieren auf menschlichem Wissen, um bestimmte Änderungen auszuwählen, ohne das entsprechende Modell miteinzubeziehen. Dies beinhaltet das Entfernen von für die Aufgabe irrelevanten Objekten aus Bildern oder Variationen des visuellen Kontextes. In dem alternativen Modell-spezifischen Ansatz werden Änderungen vorgenommen, die für das Modell möglichst ungünstig sind. Zum Beispiel ändern wir die Erscheinung eines Objekts um ein Modell der Objekterkennung täuschen. Dies ist durch den Gradienten des Modells möglich. Mithilfe dieser Werkzeuge können wir die Robustheit von Systemen zur Bildklassifizierung oder -segmentierung, Objekterkennung und Visuelle Fragenbeantwortung quantifizieren und verbessern
Adversarial Scene Editing: Automatic Object Removal from Weak Supervision
While great progress has been made recently in automatic image manipulation,
it has been limited to object centric images like faces or structured scene
datasets. In this work, we take a step towards general scene-level image
editing by developing an automatic interaction-free object removal model. Our
model learns to find and remove objects from general scene images using
image-level labels and unpaired data in a generative adversarial network (GAN)
framework. We achieve this with two key contributions: a two-stage editor
architecture consisting of a mask generator and image in-painter that
co-operate to remove objects, and a novel GAN based prior for the mask
generator that allows us to flexibly incorporate knowledge about object shapes.
We experimentally show on two datasets that our method effectively removes a
wide variety of objects using weak supervision onl
Dunbar syndrome: a rare presentation of abdominal angina treated by revascularization of the celiac artery by endovascular stenting
Median arcuate ligament syndrome (MALS) is a rare entity characterized by extrinsic compression of the celiac artery and symptoms of postprandial epigastric pain, nausea, vomiting, and weight loss mimicking mesenteric ischemia. The following case illustrates a rare cause of abdominal pain, where this young woman was found to have celiac trunk stenosis , secondary to compression of the trunk by the median arcuate ligament. She underwent a successful stenting to the ostial celiac trunk, thus reliving her symptomatically. Decompression of the celiac artery is the general approach. Usually post PTA, once revascularisation is achieved, 75% of the patients remain asymptomatic at follow up
A4NT : Author Attribute Anonymity by Adversarial Training of Neural Machine Translation
Text-based analysis methods allow to reveal privacy relevant author
attributes such as gender, age and identify of the text's author. Such methods
can compromise the privacy of an anonymous author even when the author tries to
remove privacy sensitive content. In this paper, we propose an automatic
method, called Adversarial Author Attribute Anonymity Neural Translation
(), to combat such text-based adversaries. We combine
sequence-to-sequence language models used in machine translation and generative
adversarial networks to obfuscate author attributes. Unlike machine translation
techniques which need paired data, our method can be trained on unpaired
corpora of text containing different authors. Importantly, we propose and
evaluate techniques to impose constraints on our to preserve the
semantics of the input text. learns to make minimal changes to the
input text to successfully fool author attribute classifiers, while aiming to
maintain the meaning of the input. We show through experiments on two different
datasets and three settings that our proposed method is effective in fooling
the author attribute classifiers and thereby improving the anonymity of
authors.Comment: 16 pages, 10 figures and 8 table
Integrating artificial intelligence for knowledge management systems – synergy among people and technology: a systematic review of the evidence
This paper analyses Artificial Intelligence (AI) and Knowledge
Management (KM) and focuses primarily on examining to what
degree AI can help companies in their efforts to handle information and manage knowledge effectively. A search was carried out
for relevant electronic bibliographic databases and reference lists
of relevant review articles. Articles were screened and the eligibility was based on participants, procedures, comparisons, outcomes
(PICO) model, and criteria for PRISMA (Preferred Reporting Items
for Systematic Reviews). The results reveal that knowledge management and AI are interrelated fields as both are intensely connected to knowledge; the difference reflects in how – while AI
offers machines the ability to learn, KM offers a platform to better
understand knowledge. The research findings further point out
that communication, trust, information systems, incentives or
rewards, and the structure of an organization; are related to
knowledge sharing in organizations. This systematic literature
review is the first to throw light on KM practices & the knowledge
cycle and how the integration of AI aids knowledge management
systems, enterprise performance & distribution of knowledge
within the organization. The outcomes offer a better understanding of efficient and effective knowledge resource management
for organizational advantage. Future research is necessary on
smart assistant systems thus providing social benefits that
strengthen competitive advantage. This study indicates that
organizations must take note of definite KM leadership traits and
organizational arrangements to achieve stable performance
through KM
Speaking the Same Language: Matching Machine to Human Captions by Adversarial Training
While strong progress has been made in image captioning over the last years,
machine and human captions are still quite distinct. A closer look reveals that
this is due to the deficiencies in the generated word distribution, vocabulary
size, and strong bias in the generators towards frequent captions. Furthermore,
humans -- rightfully so -- generate multiple, diverse captions, due to the
inherent ambiguity in the captioning task which is not considered in today's
systems.
To address these challenges, we change the training objective of the caption
generator from reproducing groundtruth captions to generating a set of captions
that is indistinguishable from human generated captions. Instead of
handcrafting such a learning target, we employ adversarial training in
combination with an approximate Gumbel sampler to implicitly match the
generated distribution to the human one. While our method achieves comparable
performance to the state-of-the-art in terms of the correctness of the
captions, we generate a set of diverse captions, that are significantly less
biased and match the word statistics better in several aspects.Comment: 16 pages, Published in ICCV 201